{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-15T10:08:18.181363Z",
     "start_time": "2020-11-15T10:08:18.174241Z"
    }
   },
   "source": [
    "<ul id=\"breadcrumb\">\n",
    "<li><a href=\"#第4章-朴素贝叶斯\">&nbsp;</a></li>\n",
    "<li><a href=\"#朴素贝叶斯分类算法\">NaiveBayes</a></li>\n",
    "<li><a href=\"#课本例4.1\">例4.1</a></li>\n",
    "<li><a href=\"#GaussianNB-高斯朴素贝叶斯\">GaussianNB</a></li>\n",
    "<li><a href=\"#scikit-learn实例\">scikit实例</a></li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第4章 朴素贝叶斯"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1．朴素贝叶斯法是典型的生成学习方法。生成方法由训练数据学习联合概率分布\n",
    "$P(X,Y)$，然后求得后验概率分布$P(Y|X)$。具体来说，利用训练数据学习$P(X|Y)$和$P(Y)$的估计，得到联合概率分布：\n",
    "\n",
    "$$P(X,Y)＝P(Y)P(X|Y)$$\n",
    "\n",
    "概率估计方法可以是极大似然估计或贝叶斯估计。\n",
    "\n",
    "2．朴素贝叶斯法的基本假设是条件独立性，\n",
    "\n",
    "$$\\begin{aligned} P(X&=x | Y=c_{k} )=P\\left(X^{(1)}=x^{(1)}, \\cdots, X^{(n)}=x^{(n)} | Y=c_{k}\\right) \\\\ &=\\prod_{j=1}^{n} P\\left(X^{(j)}=x^{(j)} | Y=c_{k}\\right) \\end{aligned}$$\n",
    "\n",
    "\n",
    "这是一个较强的假设。由于这一假设，模型包含的条件概率的数量大为减少，朴素贝叶斯法的学习与预测大为简化。因而朴素贝叶斯法高效，且易于实现。其缺点是分类的性能不一定很高。\n",
    "\n",
    "3．朴素贝叶斯法利用贝叶斯定理与学到的联合概率模型进行分类预测。\n",
    "\n",
    "$$P(Y | X)=\\frac{P(X, Y)}{P(X)}=\\frac{P(Y) P(X | Y)}{\\sum_{Y} P(Y) P(X | Y)}$$\n",
    " \n",
    "将输入$x$分到后验概率最大的类$y$。\n",
    "\n",
    "$$y=\\arg \\max _{c_{k}} P\\left(Y=c_{k}\\right) \\prod_{j=1}^{n} P\\left(X_{j}=x^{(j)} | Y=c_{k}\\right)$$\n",
    "\n",
    "后验概率最大等价于0-1损失函数时的期望风险最小化。\n",
    "\n",
    "\n",
    "模型：\n",
    "\n",
    "- 高斯模型\n",
    "- 多项式模型\n",
    "- 伯努利模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 朴素贝叶斯分类算法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-15T11:20:41.109036Z",
     "start_time": "2020-11-15T11:20:41.096745Z"
    }
   },
   "outputs": [],
   "source": [
    "class NB:\n",
    "\n",
    "    def __init__(self, lambda_=1):\n",
    "        self.lambda_ = lambda_\n",
    "        self.classes_ = None\n",
    "        self.prior_ = None\n",
    "        self.class_prior_ = None\n",
    "        self.class_count_ = None\n",
    "\n",
    "    def fit(self, X, y, debug=False):\n",
    "        self.classes_ = np.unique(y)\n",
    "        # to df\n",
    "        X = pd.DataFrame(X)\n",
    "        y = pd.DataFrame(y)\n",
    "\n",
    "        self.class_count_ = y[y.columns[0]].value_counts()\n",
    "        self.class_prior_ = self.class_count_/y.shape[0]\n",
    "        if debug:\n",
    "            print(\"P(Y={})={}/{}\".format(self.classes_[0],self.class_count_[0],y.shape[0]))\n",
    "            print(\"P(Y={})={}/{}\".format(self.classes_[1],self.class_count_[1],y.shape[0]))\n",
    "            \n",
    "        # prior\n",
    "        self.prior_ = dict()\n",
    "        for idx in X.columns:\n",
    "            for j in self.classes_:\n",
    "                p_x_y = X[(y == j).values][idx].value_counts()\n",
    "                for i in p_x_y.index:\n",
    "                    self.prior_[(idx, i, j)] = p_x_y[i]/self.class_count_[j]\n",
    "                    if debug:\n",
    "                        print(\"P(X^{}={}|Y={})={}/{}\".format(idx,i,j,p_x_y[i],self.class_count_[j]))\n",
    "\n",
    "    def predict(self, X, debug=False):\n",
    "        rst = []\n",
    "        for class_ in self.classes_:\n",
    "            if debug:\n",
    "                msg=\"P(Y={})\".format(class_)\n",
    "            py = self.class_prior_[class_]\n",
    "            pxy = 1\n",
    "            for idx, x in enumerate(X):\n",
    "                pxy *= self.prior_[(idx, x, class_)]\n",
    "                if debug:\n",
    "                    msg +=\"P(X^{}={}|Y={})\".format(idx,x,class_)\n",
    "\n",
    "            rst.append(py*pxy)\n",
    "            if debug:\n",
    "                msg += \"={:.2f}\".format(rst[-1])\n",
    "                print(msg)\n",
    "                \n",
    "        return self.classes_[np.argmax(rst)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 课本例4.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-15T10:41:44.550500Z",
     "start_time": "2020-11-15T10:41:44.528189Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-1\n"
     ]
    }
   ],
   "source": [
    "data = pd.read_csv(\"data_4-1.txt\", header=None, sep=\",\")\n",
    "X = data[data.columns[0:2]]\n",
    "y = data[data.columns[2]]\n",
    "clf = NB(lambda_=1)\n",
    "clf.fit(X, y)\n",
    "rst = clf.predict([2, \"S\"])\n",
    "print(rst)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-15T11:23:51.718237Z",
     "start_time": "2020-11-15T11:23:51.694240Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "P(Y=ab)=9/15\n",
      "P(Y=ok)=6/15\n",
      "P(X^0=1|Y=ab)=3/6\n",
      "P(X^0=2|Y=ab)=2/6\n",
      "P(X^0=3|Y=ab)=1/6\n",
      "P(X^0=3|Y=ok)=4/9\n",
      "P(X^0=2|Y=ok)=3/9\n",
      "P(X^0=1|Y=ok)=2/9\n",
      "P(X^1=S|Y=ab)=3/6\n",
      "P(X^1=M|Y=ab)=2/6\n",
      "P(X^1=L|Y=ab)=1/6\n",
      "P(X^1=M|Y=ok)=4/9\n",
      "P(X^1=L|Y=ok)=4/9\n",
      "P(X^1=S|Y=ok)=1/9\n",
      "P(Y=ab)P(X^0=2|Y=ab)P(X^1=S|Y=ab)=0.07\n",
      "P(Y=ok)P(X^0=2|Y=ok)P(X^1=S|Y=ok)=0.02\n",
      "ab\n"
     ]
    }
   ],
   "source": [
    "data = pd.read_csv(\"data_4-2.txt\", header=None, sep=\",\")\n",
    "X = data[data.columns[0:2]]\n",
    "y = data[data.columns[2]]\n",
    "clf = NB(lambda_=0)\n",
    "clf.fit(X, y, debug=True)\n",
    "rst = clf.predict([2, \"S\"], debug=True)\n",
    "print(rst)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GaussianNB 高斯朴素贝叶斯\n",
    "\n",
    "### iris数据集\n",
    "iris数据集中两个分类的数据和[sepal length，sepal width]作为特征\n",
    "![Sepal vs. Petal](images/sepal_petal.jpg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-15T11:31:01.463460Z",
     "start_time": "2020-11-15T11:31:00.650241Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "from sklearn.datasets import load_iris\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "from collections import Counter\n",
    "import math"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-15T11:31:01.513809Z",
     "start_time": "2020-11-15T11:31:01.508251Z"
    }
   },
   "outputs": [],
   "source": [
    "# data\n",
    "def create_data():\n",
    "    iris = load_iris()\n",
    "    df = pd.DataFrame(iris.data, columns=iris.feature_names)\n",
    "    df['label'] = iris.target\n",
    "    df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']\n",
    "    data = np.array(df.iloc[:100, :])\n",
    "    # print(data)\n",
    "    return data[:,:-1], data[:,-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-15T11:50:57.644000Z",
     "start_time": "2020-11-15T11:50:57.634211Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "训练集占70%，测试集占30%\n",
      "测试样本：x=[5.5 3.5 1.3 0.2] y=0.0\n"
     ]
    }
   ],
   "source": [
    "X, y = create_data()\n",
    "t_size=0.3\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=t_size)\n",
    "print(\"训练集占{}%，测试集占{}%\".format(int((1-t_size)*100), int(t_size*100)))\n",
    "print(\"测试样本：x={} y={}\".format(X_test[0], y_test[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "参考：https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/\n",
    "\n",
    "### GaussianNB 高斯朴素贝叶斯\n",
    "\n",
    "特征的可能性被假设为高斯\n",
    "\n",
    "概率密度函数：\n",
    "$$P(x_i | y_k)=\\frac{1}{\\sqrt{2\\pi\\sigma^2_{yk}}}exp(-\\frac{(x_i-\\mu_{yk})^2}{2\\sigma^2_{yk}})$$\n",
    "\n",
    "数学期望(mean)：$\\mu$\n",
    "\n",
    "方差：$\\sigma^2=\\frac{\\sum(X-\\mu)^2}{N}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-15T11:30:52.678883Z",
     "start_time": "2020-11-15T11:30:52.665239Z"
    }
   },
   "outputs": [],
   "source": [
    "class NaiveBayes:\n",
    "    def __init__(self):\n",
    "        self.model = None\n",
    "\n",
    "    # 数学期望\n",
    "    @staticmethod\n",
    "    def mean(X):\n",
    "        return sum(X) / float(len(X))\n",
    "\n",
    "    # 标准差（方差）\n",
    "    def stdev(self, X):\n",
    "        avg = self.mean(X)\n",
    "        return math.sqrt(sum([pow(x - avg, 2) for x in X]) / float(len(X)))\n",
    "\n",
    "    # 概率密度函数\n",
    "    def gaussian_probability(self, x, mean, stdev):\n",
    "        exponent = math.exp(-(math.pow(x - mean, 2) /\n",
    "                              (2 * math.pow(stdev, 2))))\n",
    "        return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent\n",
    "\n",
    "    # 处理X_train\n",
    "    def summarize(self, train_data):\n",
    "        summaries = [(self.mean(i), self.stdev(i)) for i in zip(*train_data)]\n",
    "        return summaries\n",
    "\n",
    "    # 分类别求出数学期望和标准差\n",
    "    def fit(self, X, y):\n",
    "        labels = list(set(y))\n",
    "        data = {label: [] for label in labels}\n",
    "        for f, label in zip(X, y):\n",
    "            data[label].append(f)\n",
    "        self.model = {\n",
    "            label: self.summarize(value)\n",
    "            for label, value in data.items()\n",
    "        }\n",
    "        return 'gaussianNB train done!'\n",
    "\n",
    "    # 计算概率\n",
    "    def calculate_probabilities(self, input_data):\n",
    "        # summaries:{0.0: [(5.0, 0.37),(3.42, 0.40)], 1.0: [(5.8, 0.449),(2.7, 0.27)]}\n",
    "        # input_data:[1.1, 2.2]\n",
    "        probabilities = {}\n",
    "        for label, value in self.model.items():\n",
    "            probabilities[label] = 1\n",
    "            for i in range(len(value)):\n",
    "                mean, stdev = value[i]\n",
    "                probabilities[label] *= self.gaussian_probability(\n",
    "                    input_data[i], mean, stdev)\n",
    "        return probabilities\n",
    "\n",
    "    # 类别\n",
    "    def predict(self, X_test):\n",
    "        # {0.0: 2.9680340789325763e-27, 1.0: 3.5749783019849535e-26}\n",
    "        label = sorted(\n",
    "            self.calculate_probabilities(X_test).items(),\n",
    "            key=lambda x: x[-1])[-1][0]\n",
    "        return label\n",
    "\n",
    "    def score(self, X_test, y_test):\n",
    "        right = 0\n",
    "        for X, y in zip(X_test, y_test):\n",
    "            label = self.predict(X)\n",
    "            if label == y:\n",
    "                right += 1\n",
    "\n",
    "        return right / float(len(X_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-15T11:50:13.098803Z",
     "start_time": "2020-11-15T11:50:13.093028Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "预测类别：0.0 分数：1.0\n"
     ]
    }
   ],
   "source": [
    "model = NaiveBayes()\n",
    "model.fit(X_train, y_train)\n",
    "print(\"预测类别：{:.1f} 分数：{:.1f}\".format(model.predict([4.4,  3.2,  1.3,  0.2]), model.score(X_test, y_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### scikit-learn实例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-15T11:52:32.053003Z",
     "start_time": "2020-11-15T11:52:32.046317Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import GaussianNB"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-15T11:52:34.964415Z",
     "start_time": "2020-11-15T11:52:34.956095Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GaussianNB(priors=None, var_smoothing=1e-09)"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf = GaussianNB()\n",
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-15T11:56:19.200683Z",
     "start_time": "2020-11-15T11:56:19.194521Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "预测类别：[0.] 分数：1.0\n"
     ]
    }
   ],
   "source": [
    "print(\"预测类别：{} 分数：{}\".format(clf.predict([[4.4,  3.2,  1.3,  0.2]]), clf.score(X_test, y_test)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import BernoulliNB, MultinomialNB # 伯努利模型和多项式模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "----\n",
    "参考代码：https://github.com/wzyonggege/statistical-learning-method\n",
    "\n",
    "中文注释制作：机器学习初学者\n",
    "\n",
    "微信公众号：ID:ai-start-com\n",
    "\n",
    "配置环境：python 3.5+\n",
    "\n",
    "代码全部测试通过。\n",
    "![gongzhong](images/gongzhong.jpg)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
